Model Selection

Audio-Visual Fusion

# Audio-Visual Fusion

Videollama2.1 7B 16F Base

VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.

Transformers English

Videollama2 72B

VideoLLaMA 2 is a multimodal large language model focused on video understanding and spatio-temporal modeling, supporting video and image inputs, capable of performing visual question answering and dialogue tasks.

Transformers English

Videollama2 8x7B

VideoLLaMA 2 is a multimodal large language model focused on video understanding and audio processing, capable of handling video and image inputs to generate natural language responses.

Transformers English

Videollama2 7B 16F Base

VideoLLaMA 2 is a multimodal large language model focused on enhancing spatio-temporal modeling and audio understanding in video comprehension.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase